NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

What Should We Engineer in Prompts? Training Humans in Requirement-Driven LLM Use

https://doi.org/10.1145/3731756

Ma, Qianou; Peng, Weirui; Yang, Chenyang; Shen, Hua; Koedinger, Kenneth; Wu, Tongshuang (April 2025, ACM Transactions on Computer-Human Interaction)

Prompting LLMs for complex tasks (e.g., building a trip advisor chatbot) needs humans to clearly articulate customized requirements (e.g., “start the response with a tl;dr”). However, existing prompt engineering instructions often lack focused training on requirement articulation and instead tend to emphasize increasingly automatable strategies (e.g., tricks like adding role-plays and “think step-by-step”). To address the gap, we introduce Requirement-Oriented Prompt Engineering (ROPE), a paradigm that focuses human attention on generating clear, complete requirements during prompting. We implement ROPE through an assessment and training suite that provides deliberate practice with LLM-generated feedback. In a randomized controlled experiment with 30 novices, ROPE significantly outperforms conventional prompt engineering training (20% vs. 1% gains), a gap that automatic prompt optimization cannot close. Furthermore, we demonstrate a direct correlation between the quality of input requirements and LLM outputs. Our work paves the way to empower more end-users to build complex LLM applications.
more » « less
Free, publicly-accessible full text available April 24, 2026
Synergi: A Mixed-Initiative System for Scholarly Synthesis and Sensemaking

https://doi.org/10.1145/3586183.3606759

Kang, Hyeonsu B; Wu, Tongshuang; Chang, Joseph Chee; Kittur, Aniket (October 2023, ACM)

Full Text Available
Seeing Seeds Beyond Weeds: Green Teaming Generative AI for Beneficial Uses

Stapleton, Logan; Taylor, Jordan; Fox, Sarah; Wu, Tongshuang; Zhu, Haiyi. (July 2023, ICML Workshop)

Large generative AI models (GMs) like GPT and DALL-E are trained to generate content for general, wide-ranging purposes. GM content filters are generalized to filter out content which has a risk of harm in many cases, e.g., hate speech. However, prohibited content is not always harmful -- there are instances where generating prohibited content can be beneficial. So, when GMs filter out content, they preclude beneficial use cases along with harmful ones. Which use cases are precluded reflects the values embedded in GM content filtering. Recent work on red teaming proposes methods to bypass GM content filters to generate harmful content. We coin the term green teaming to describe methods of bypassing GM content filters to design for beneficial use cases. We showcase green teaming by: 1) Using ChatGPT as a virtual patient to simulate a person experiencing suicidal ideation, for suicide support training; 2) Using Codex to intentionally generate buggy solutions to train students on debugging; and 3) Examining an Instagram page using Midjourney to generate images of anti-LGBTQ+ politicians in drag. Finally, we discuss how our use cases demonstrate green teaming as both a practical design method and a mode of critique, which problematizes and subverts current understandings of harms and values in generative AI.
more » « less
Beyond Testers’ Biases: Guiding Model Testing with Knowledge Bases using LLMs

https://doi.org/10.18653/v1/2023.findings-emnlp.901

Yang, Chenyang; Rustogi, Rishabh; Brower-Sinning, Rachel; Lewis, Grace; Kaestner, Christian; Wu, Tongshuang (January 2023, Association for Computational Linguistics)

Full Text Available
StoryBuddy: A Human-AI Collaborative Chatbot for Parent-Child Interactive Storytelling with Flexible Parental Involvement

https://doi.org/10.1145/3491102.3517479

Zhang, Zheng; Xu, Ying; Wang, Yanhao; Yao, Bingsheng; Ritchie, Daniel; Wu, Tongshuang; Yu, Mo; Wang, Dakuo; Li, Toby Jia-Jun (April 2022, Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems)

Despite its benefits for children’s skill development and parent-child bonding, many parents do not often engage in interactive storytelling by having story-related dialogues with their child due to limited availability or challenges in coming up with appropriate questions. While recent advances made AI generation of questions from stories possible, the fully-automated approach excludes parent involvement, disregards educational goals, and underoptimizes for child engagement. Informed by need-finding interviews and participatory design (PD) results, we developed StoryBuddy, an AI-enabled system for parents to create interactive storytelling experiences. StoryBuddy’s design highlighted the need for accommodating dynamic user needs between the desire for parent involvement and parent-child bonding and the goal of minimizing parent intervention when busy. The PD revealed varied assessment and educational goals of parents, which StoryBuddy addressed by supporting configuring question types and tracking child progress. A user study validated StoryBuddy’s usability and suggested design insights for future parent-AI collaboration systems.
more » « less
Full Text Available
Fantastic Questions and Where to Find Them: FairytaleQA – An Authentic Dataset for Narrative Comprehension

https://doi.org/10.18653/v1/2022.acl-long.34

Xu, Ying; Wang, Dakuo; Yu, Mo; Ritchie, Daniel; Yao, Bingsheng; Wu, Tongshuang; Zhang, Zheng; Li, Toby; Bradford, Nora; Sun, Branda; et al (April 2022, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers))

Question answering (QA) is a fundamental means to facilitate assessment and training of narrative comprehension skills for both machines and young children, yet there is scarcity of high-quality QA datasets carefully designed to serve this purpose. In particular, existing datasets rarely distinguish fine-grained reading skills, such as the understanding of varying narrative elements. Drawing on the reading education research, we introduce FairytaleQA, a dataset focusing on narrative comprehension of kindergarten to eighth-grade students. Generated by educational experts based on an evidence-based theoretical framework, FairytaleQA consists of 10,580 explicit and implicit questions derived from 278 children-friendly stories, covering seven types of narrative elements or relations. Our dataset is valuable in two folds: First, we ran existing QA models on our dataset and confirmed that this annotation helps assess models’ fine-grained learning skills. Second, the dataset supports question generation (QG) task in the education domain. Through benchmarking with QG models, we show that the QG model trained on FairytaleQA is capable of asking high-quality and more diverse questions.
more » « less
Full Text Available
Tempura: Query Analysis with Structural Templates

https://doi.org/10.1145/3313831.3376451

Wu, Tongshuang; Wongsuphasawat, Kanit; Ren, Donghao; Patel, Kayur; DuBois, Chris (April 2020, Human Factors in Computing Systems)

Analyzing queries from search engines and intelligent assistants is difficult. A key challenge is organizing queries into interpretable, context-preserving, representative, and flexible groups. We present structural templates, abstract queries that replace tokens with their linguistic feature forms, as a query grouping method. The templates allow analysts to create query groups with structural similarity at different granularities. We introduce Tempura, an interactive tool that lets analysts explore a query dataset with structural templates. Tempura summarizes a query dataset by selecting a representative subset of templates to show the query distribution. The tool also helps analysts navigate the template space by suggesting related templates likely to yield further explorations. Our user study shows that Tempura helps analysts examine the distribution of a query dataset, find labeling errors, and discover model error patterns and outliers.
more » « less
Full Text Available
Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

https://doi.org/10.18653/v1/2020.acl-main.442

Ribeiro, Marco Tulio; Wu, Tongshuang; Guestrin, Carlos; Singh, Sameer (January 2020, Annual Meeting of the Association for Computational Linguistics)

Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
more » « less
Full Text Available
Local Decision Pitfalls in Interactive Machine Learning: An Investigation into Feature Selection in Sentiment Analysis

https://doi.org/10.1145/3319616

Wu, Tongshuang; Weld, Daniel S.; Heer, Jeffrey (July 2019, ACM Transactions on Computer-Human Interaction)

Full Text Available
No Explainability without Accountability: An Empirical Study of Explanations and Feedback in Interactive ML

https://doi.org/10.1145/3313831.3376624

Smith-Renner, Alison; Fan, Ron; Birchfield, Melissa; Wu, Tongshuang; Boyd-Graber, Jordan; Weld, Daniel S.; Findlater, Leah (January 2020, CHI '20: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems)
null (Ed.)
Automatically generated explanations of how machine learning (ML) models reason can help users understand and accept them. However, explanations can have unintended consequences: promoting over-reliance or undermining trust. This paper investigates how explanations shape users' perceptions of ML models with or without the ability to provide feedback to them: (1) does revealing model flaws increase users' desire to "fix" them; (2) does providing explanations cause users to believe - wrongly - that models are introspective, and will thus improve over time. Through two controlled experiments - varying model quality - we show how the combination of explanations and user feedback impacted perceptions, such as frustration and expectations of model improvement. Explanations without opportunity for feedback were frustrating with a lower quality model, while interactions between explanation and feedback for the higher quality model suggest that detailed feedback should not be requested without explanation. Users expected model correction, regardless of whether they provided feedback or received explanations.
more » « less
Full Text Available

« Prev Next »

Search for: All records